DID with Staggered Adoption

Lee Kennedy-Shaffer, PhD

2024-06-18

Motivating Example: COVID-19 Vaccine Mandates

Question

What happens when multiple units adopt the intervention at different times?

Example

Different states adopted COVID-19 vaccine mandates for state employees at different times.

Staggered Adoption

Panel Data Setting

  • Multiple units, treated at different time points

  • Multiple time points

Schematic of multiple units observed over time, with transition from white squares (representing control) to green squares (representing intervention) over time

Variety of 2x2 Comparisons

Staggered treatment adoption schematic with a 2x2 comparison of a late vs. early treatment adopter highlighted

Variety of 2x2 Comparisons

Staggered treatment adoption schematic with 2x2 comparison of early vs. late treatment adopter

Two-Way Fixed Effects Model

Caution

The TWFE model can accommodate this:

\[ Y_{it} = \alpha_i + \gamma_t + \theta I(X_{it} = 1)+\epsilon_{it} \]

But as written it assumes treatment effect homogeneity across time periods, time-on-treatment, and units.

Challenge: Heterogeneity

Treatment Effect Heterogeneity

In many settings, especially in epidemiology, heterogeneity is common, especially with non-randomized adoption.

Question

What might cause heterogeneity in the effect of a state employee COVID-19 vaccine mandate?

Citation for Goodman-Bacon (2021)

Weighted Average of Effects

Line plot (schematic) of outcomes over time for two units treated at different times with different responses to intervention and one untreated unit.

Goodman-Bacon (2021), Figure 1.

Time-Varying Effects

A line plot where two lines start coinciding in the first period, diverge in the second period as one grows faster than the other, and then become parallel in the third period.

Goodman-Bacon (2021), Figure 3.

Forbidden Comparisons

The weights on treatment effects can be non-convex (i.e., negative) if either of the following are true:

  • There are time-varying treatment effects

  • There are heterogeneous treatment effects across timing groups

This gives an uninterpretable estimand, and can even switch the sign of the estimate.

Goodman-Bacon Decomposition

The TWFE model estimates a weighted average of all 2x2 DID comparisons.

A scatterplot of 2x2 DID estimates, ranging from -30 to 30, by weights, ranging from 0 to 1.2.

Goodman-Bacon (2021), Figure 6.

Goodman-Bacon Decomposition

We can also observe the overall weight given to each treatment timing group, which may be negative if it is more often used as a control than a treated group.

Scatter plot of weights for each treatment group, ranging from -0.15 to 0.30, by treatment time, ranging from 1969 to 1985.

Goodman-Bacon (2021), Figure 7.

Proposed Solutions

Group-Time ATT

Let

\[ ATT(g,t) = E[Y_{it}(g) - Y_{it}(0)], \]

the group-time ATT in period \(t\) for a unit first treated in period \(g\), compared to if it had never been treated (or not yet treated by period \(t\)).

Many solutions boil down to considering which group-time ATTs should be included in the estimand, how they differ, and how to weight them.

TWFE assumes \(ATT(g,t) = \theta\) for all \(g \le t\).

Dynamic Specification

\[ Y_{it} = \alpha_i + \gamma_t + \sum_{k \neq 0} \delta_{k} I(K_{it} = k)+\epsilon_{it}, \]

where \(K_{it}\) is the lead/lag for unit \(i\) in period \(t\) (e.g., \(K_{it} = 1\) in the first exposed period).

See Borusyak and Jaravel (2018) and Borusyak et al. (2024). Captures time-on-treatment heterogeneity.

Also useful to test for “pre-trends” in single intervention time setting.

Interaction-Weighting

We can account for timing cohort heterogeneity as well by further allowing the effect to vary by adoption timing group (\(G_i\)):

\[ Y_{it} = \alpha_i + \gamma_t + \sum_g \sum_{k \neq 0} \delta_{g,k} I(G_i = g) I(K_{it} = k) + \epsilon_{it} \]

Various methods use this approach, and differ in which comparisons/observations they allow and how they combine results. This implies different assumptions and bias-variance tradeoffs.

Restricting Observations

One approach to avoiding forbidden comparisons is restricting the observations used to fit the model.

Borusyak et al. (2024) fit this model using only the not-yet-treated observations. They then use that to derive counterfactual outcomes for comparison.

Sun and Abraham (2021) use the approach with a clean “control” \(C\) that is either never-treated or last-treated. Their regression approach then implicitly weights by population share in each timing group.

Restricting Periods: First-Difference

Another approach would only consider one switching effect. For each time period \(t\) with at least one unit untreated at \(t-1\) and treated at \(t\) and at least one unit untreated at both \(t-1\) and \(t\), compute:

\[ \begin{align*} \widehat{DID}_{+,t} = \frac{1}{N_{1,0,t}} &\sum_{i:D_{i,t}=1,D_{i,t-1}=0} \left( Y_{i,t} - Y_{i,t-1} \right) \\ &- \frac{1}{N_{0,0,t}} \sum_{i:D_{i,t}=D_{i,t-1}=0} \left( Y_{i,t} - Y_{i,t-1} \right) \end{align*} \]

Restricting Periods: First-Difference

Average these switcher estimates across all time periods \(t\), weighted by number of units or individuals.

See de Chaisemartin and d’Haultfoeuille (2020) and de Chaisemartin and d’Haultfoeuille (2023).

Restricting Periods: Last Pre-Treated

Callaway and Sant’Anna (2021) propose to estimate \(\widehat{ATT}_{g,t}\) for each timing group \(g\) and period \(t\) using a non-parametric scheme compared to the last pre-treatment period: suggested approaches are IPW, OR, and DR.

Then summarize to an overall average effect weighted by \(w_{g,t}\):

\[ \theta = \sum_g \sum_{t=2}^T w_{g,t} ATT_{g,t}. \]

Generalized 2x2 Building Blocks

More generally, can weight across all 2x2 DID comparisons, with weights chosen to target a specific estimand and then minimize variance.

\[ \hat{\theta} = \sum_{i,i',t,t'} w_{i,i',t,t'} \left[ \left( Y_{i,t'} - Y_{i,t} \right) - \left( Y_{i',t'} - Y_{i',t} \right) \right] \]

See Kennedy-Shaffer (2024).

Summary of Options

Other Challenges

Other Assumptions and Limitations

The no-anticipation (or known/limited anticipation) assumption still must hold, as must the no-spillover assumption.

All of these approaches change the precise specification of the estimand as well: the ATT must be interpreted in terms of the included time periods, lags, and units, and how they are weighted.

When is DID useful? When is it lacking?

Important

It’s easy to ignore the fundamentals when using the more advanced methods. Consider the validity of the data, the question being asked, and the feasibility of the effect.

Recommendations for Staggered Adoption

  • Consider data source carefully

  • Think about possible heterogeneities and desired estimand

  • Use graphical displays and diagnostics to assess possible biases and trade-offs

  • Consider multiple estimation methods for robustness to different assumptions

  • Pre-specify, and explore with appropriate caveats

Questions